DATA ENGINEERING ARCHITECTURE WITH AZURE SERVICES
Vaishnavi Shankar Devadig, 2025
Data engineering is the craft of making data readable, reliable, fast, and usable so people can ship decisions and models with confidence. It is less about the shiny dashboards we see in the end and more about the plumbing, how that data was made available in a way that it could be visualized beautifully, analyzed perfectly and how it can answer some very real questions.
This project is more like a basic level implementation of something similar I do as a part of my work as a data engineer in the healthcare sector. There is a lot more I do, which obviously cannot be included in one project :)
GitHub Link: https://github.com/vdevadig/Data_Engineering_Framework_with_Azure
Technologies, Services and languages Used:ΒΆ
- Azure Data Factory
- SQL Database (Cloud hosted MySQL Database used)
- NoSQL Database (Cloud hosted MongoDB used)
- GitHub Repo
- Azure Databricks
- Azure Synapse Analytics
- Azure Data Lake Storage 2
- Others Azure Services like App Registrations, IAM Role Assignment, Unity Catalog Connector for ADB
- PySpark, Python, SQL
Data Used:ΒΆ
- Drugs_product.csv
- Drugs_unfinished_products.csv
- Nhanes_2013_2014.csv
- Condition_Keyword_Map.csv (to perform join on the product datasets)
These are mostly datasets that cannot be joined. PySpark transformations have been performed on these and moved across the pipelines. However for the final part of this implementation, I have used only one dataset.
Uploading data into multiple sources:ΒΆ
- Drugs_product.csv and Nhanes_2013_2014.csv are uploaded to GitHub Repo (Ideally should be a private repo)
- Drugs_unfinished_products.csv is uploaded to the MySQL DB
- Condition_Mapping_Index is uploaded to MongoDB.
I have used to filess.io to host my databases, and python data scripts to upload the data.
SQLDB instanceΒΆ
MongoDB instanceΒΆ
Azure Data Factory:ΒΆ
Data is ingested from two sources (Git and SQL DB, by creating an ADF pipeline, into the bronze layer of our medallion architecture (hc-staging)
For more than one CSV in git, we lookup from a JSON file for the pipeline to iterate through the filenames and retrieve the data and using the HTTP linked service
SQL data is retrieved by connecting our MySQL DB instance using a MYSQL linked service
JSON file parameters defined in ADF sourceΒΆ
ADF Run for files from external SQL DB and Git RepoΒΆ
Lookup Activity
HTTP Linked Service
MySQL Linked Service
Storage Account and ContainerΒΆ
Staging layer after the pipeline runΒΆ
Azure Databricks:ΒΆ
Starts with the creation of a unity catalog connector to Databricks from Azure. This helps with treating databricks as a Managed Identity and providing it any roles and permissions.
Once done, we can need to create a storage credential on the databricks side and create the external location using the URL for the ADLS location.
Regardless, one thing to note is that whoever is the owner of this external location, has full freedom over choosing who can access the data from the databricks end as well.
The github link provided has the Databricks notebook I used to transform the data. The transformations are not too complicated, as the data was almost already clean. I have other project in my portfolio, where I have worked specifically on data cleaning and transformation.
Moving tranformed data to the second layer:ΒΆ
Once the transformations are completed, we write the data into the silver layer in Parquet format, as it is much more superior than csv in terms of compression and encoding and also much more optimized for OLAP.
Please note that I have only moved the unfinished products CSV into the silver layer for the sake of simplicity.
Azure Synapse Analytics:ΒΆ
Synapse has been used in this project majorly for analytics purposes of the silver layer data and storing any views we create as gold layer.
This gold standard, targeted data can be used by stakeholders who come next in the process, whether they are analysts, data scientists, business, etc.
While I could have used a Synapse Workspace for doing this whole process (from ingestion, till the end), it is better to keep things separate.
Viewing and performing analytics on ASAΒΆ
Creating a gold standard external tableΒΆ
Gold Standard data in the final hc-data folderΒΆ
Final Thoughts:ΒΆ
That brings us to the end! While this is a very basic framework, a lot more things can be added on top of it, to make it better.
Let's say, instead of running the databricks notebook standalone, create another ADF pipeline that gets triggered once every hour to pick up all the data from the staging folder.
Also include an ADB notebook run activity in the pipeline and once successful, end it with a delete activity that removes all of the staging data.
The ideas to make the flow look closer to industry standards are endless, I might continue working on this to make it better!